Documentation
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Chat Completions
Use llm.respond(...)
to generate completions for a chat conversation.
The following snippet shows how to obtain the AI's response to a quick chat prompt.
import lmstudio as lms
model = lms.llm()
print(model.respond("What is the meaning of life?"))
The following snippet shows how to stream the AI's response to a chat prompt, displaying text fragments as they are received (rather than waiting for the entire response to be generated before displaying anything).
import lmstudio as lms
model = lms.llm()
for fragment in model.respond_stream("What is the meaning of life?"):
print(fragment.content, end="", flush=True)
print() # Advance to a new line at the end of the response
See the Cancelling a Prediction section for how to cancel a prediction in progress.
First, you need to get a model handle.
This can be done using the top-level llm
convenience API,
or the model
method in the llm
namespace when using the scoped resource API.
For example, here is how to use Qwen2.5 7B Instruct.
import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct")
There are other ways to get a model handle. See Managing Models in Memory for more info.
The input to the model is referred to as the "context". Conceptually, the model receives a multi-turn conversation as input, and it is asked to predict the assistant's response in that conversation.
import lmstudio as lms
# Create a chat with an initial system prompt.
chat = lms.Chat("You are a resident AI philosopher.")
# Build the chat context by adding messages of relevant types.
chat.add_user_message("What is the meaning of life?")
# ... continued in next example
See Working with Chats for more information on managing chat context.
You can ask the LLM to predict the next response in the chat context using the respond()
method.
# The `chat` object is created in the previous step.
result = model.respond(chat)
print(result)
You can pass in inferencing parameters via the config
keyword parameter on .respond()
.
prediction_stream = model.respond_stream(chat, config={
"temperature": 0.6,
"maxTokens": 50,
})
See Configuring the Model for more information on what can be configured.
You can also print prediction metadata, such as the model used for generation, number of generated tokens, time to first token, and stop reason.
# After iterating through the prediction fragments,
# the overall prediction result may be obtained from the stream
result = prediction_stream.result()
print("Model used:", result.model_info.display_name)
print("Predicted tokens:", result.stats.predicted_tokens_count)
print("Time to first token (seconds):", result.stats.time_to_first_token_sec)
print("Stop reason:", result.stats.stop_reason)
import lmstudio as lms
model = lms.llm()
chat = lms.Chat("You are a task focused AI assistant")
while True:
try:
user_input = input("You (leave blank to exit): ")
except EOFError:
print()
break
if not user_input:
break
chat.add_user_message(user_input)
prediction_stream = model.respond_stream(
chat,
on_message=chat.append,
)
print("Bot: ", end="", flush=True)
for fragment in prediction_stream:
print(fragment.content, end="", flush=True)
print()
Long prompts will often take a long time to first token, i.e. it takes the model a long time to process your prompt.
If you want to get updates on the progress of this process, you can provide a float callback to respond
that receives a float from 0.0-1.0 representing prompt processing progress.
import lmstudio as lms
llm = lms.llm()
response = llm.respond(
"What is LM Studio?",
on_prompt_processing_progress = (lambda progress: print(f"{progress*100}% complete")),
)
In addition to on_prompt_processing_progress
, the other available progress callbacks are:
on_first_token
: called after prompt processing is complete and the first token is being emitted.
Does not receive any arguments (use the streaming iteration API or on_prediction_fragment
to process tokens as they are emitted).on_prediction_fragment
: called for each prediction fragment received by the client.
Receives the same prediction fragments as iterating over the stream iteration API.on_message
: called with an assistant response message when the prediction is complete.
Intended for appending received messages to a chat history instance.On this page
Quick Example: Generate a Chat Response
Streaming a Chat Response
Cancelling a Chat Response
Obtain a Model
Manage Chat Context
Generate a response
Customize Inferencing Parameters
Print prediction stats
Example: Multi-turn Chat
Progress Callbacks